AI Architecture Decision Sheet

🧠

AI Backbone — LLM Provider

Core intelligence layer: model selection, API provider, deployment model

▼

💡

Decision principle: Choose your orchestration framework first, LLM provider second. Most frameworks are model-agnostic. For MVP, start with hosted APIs. For production, evaluate latency, cost-per-token, data sovereignty, and fine-tuning needs.

Tool / Model	Phase	Type	Why Choose It	Tradeoffs	Complexity	Alternatives
GPT-4o / GPT-4.1 SaaS OpenAI — multimodal, text/image/audio	BOTH	SaaS API	Best-in-class reasoning, massive ecosystem, vision built-in. Go-to for complex agent tasks, code gen, document analysis.	⚠ Data leaves your infra. Cost scales with tokens. Rate limits on free tiers.	LOW	Claude 3.5, Gemini 1.5 Pro
Claude 3.5 / Claude 4 SaaS Anthropic — 200k context, strong reasoning	BOTH	SaaS API	Best for long-document analysis, instruction following, and low hallucination. 200k context window is unmatched for RAG with large docs.	⚠ No self-hosting. Limited fine-tuning options.	LOW	GPT-4o, Gemini
Azure OpenAI Service AZURE GPT-4o on Azure infra — enterprise compliance	BOTH	Managed	Data stays in your Azure tenant. HIPAA/SOC2 compliant. Required for enterprise/gov clients. PTU for guaranteed throughput in prod.	⚠ Slower model updates vs OpenAI direct. Needs Azure subscription setup.	MED	OpenAI direct API
Llama 3 / Mistral OSS Self-hosted open-source models	PROD	OSS	Zero per-token cost at scale. Full data sovereignty. Fine-tunable for domain-specific tasks. Deploy on your GPU infra or AKS.	⚠ Needs GPU infra. Ops overhead. Smaller context. Weaker reasoning than GPT-4 class.	HIGH	Phi-3, Gemma 2
vLLM / Ollama OSS Self-hosted LLM inference server	PROD	Runtime	High-throughput batched inference for production. vLLM: best for multi-user APIs. Ollama: best for local dev and edge deployment.	⚠ Infra management required. GPU costs.	HIGH	TGI (HuggingFace), LMDeploy

⚙️

Orchestration Framework

The "nervous system" — agent coordination, workflow, tool use, multi-agent patterns

▼

💡

Decision principle: This is the FIRST architectural decision. It shapes everything else. LangGraph = stateful/cyclical agents. CrewAI = role-based teams. AutoGen = conversation-based multi-agent. Semantic Kernel = enterprise/.NET first.

Tool	Phase	Type	Why Choose It	Tradeoffs	Complexity	Alternatives
LangGraph OSS LangChain — stateful graph-based agent orchestration	BOTH	OSS	Best for complex, stateful agents with loops, branching, human-in-the-loop. Built-in checkpointing, memory, streaming. Most production-ready OSS option.	⚠ Steep learning curve. Verbose graph definition.	HIGH	CrewAI, AutoGen
CrewAI OSS Role-based multi-agent crews	MVP	OSS	Fastest path to multi-agent MVP. Intuitive agent/task/crew abstraction. Great for sequential role delegation pipelines. Client demos well.	⚠ Less control over state. Less flexible for non-sequential flows.	LOW	LangGraph, n8n
AutoGen v0.4 OSS Microsoft — conversation-based multi-agent	BOTH	OSS	Best for coding agents, autonomous problem solving, multi-agent debate patterns. AgentChat API is clean. Strong Azure/Microsoft ecosystem alignment.	⚠ Less structured workflow control than LangGraph.	MED	LangGraph, CrewAI
Semantic Kernel AZURE Microsoft — enterprise SDK for AI orchestration	PROD	SDK	Best choice if client is .NET/C# shop or deep Azure tenant. Native Azure AI Foundry integration, enterprise memory patterns, plugin architecture.	⚠ Python support lags .NET. Smaller community than LangChain.	MED	LangGraph + Azure OpenAI
n8n OSS Low-code workflow automation with AI nodes	MVP	OSS	Ideal for deterministic, operational automation (CRM sync, email triage, data pipelines). Non-dev stakeholders can edit. Fast POC delivery.	⚠ Not suited for complex reasoning or autonomous agents. Not production-grade AI logic.	LOW	Make.com, Zapier AI

🔍

Vector Store / Semantic Search

Embedding storage, ANN search, retrieval backbone for RAG systems

▼

💡

Decision principle: Vector DB handles semantic/fuzzy search. SQL handles exact/structured retrieval. Best RAG architectures use BOTH: vector search returns candidate IDs → SQL resolves to full structured records. Choose vector DB based on data volume, filter needs, and managed vs self-hosted preference.

Tool	Phase	Type	Why Choose It	Tradeoffs	Scale	Alternatives
ChromaDB OSS In-process or client-server vector DB	MVP	OSS	Zero infrastructure to set up. Perfect for POC and MVP. Runs in-process in Python. Easy LangChain/LlamaIndex integration.	⚠ Not production-grade at scale. Limited metadata filtering. No managed cloud offering.	LOW — <1M vecs	FAISS, Qdrant
Qdrant OSS Rust-based, high-performance vector DB	BOTH	OSS	Best OSS option for production. Rich payload filtering, sparse+dense hybrid search, cloud and self-hosted. Docker-ready, fast.	⚠ Smaller ecosystem than Pinecone. Needs infra management if self-hosted.	MED — millions	Weaviate, Pinecone
Pinecone SaaS Fully managed, serverless vector DB	PROD	Managed	Zero ops. Serverless pricing model. Strong enterprise support. Best for teams without MLOps capacity who need reliable prod vector search.	⚠ Vendor lock-in. Data leaves infra. Cost at scale. No SQL-style joins.	HIGH — billions	Weaviate Cloud, Qdrant Cloud
Azure AI Search AZURE Cognitive Search + vector indexing on Azure	PROD	Managed	Best choice for Azure-native stack. Hybrid search (keyword + vector), integrated with Azure OpenAI, Cosmos DB, Blob Storage. Enterprise SLA.	⚠ Azure lock-in. Higher cost than OSS. Slower feature velocity.	HIGH — enterprise	Qdrant + AKS
Weaviate OSS GraphQL API, multi-modal, hybrid search	PROD	OSS	Best for multi-modal (text + image) retrieval. Built-in vectorizer modules, GraphQL API, object-level permissions. Strong enterprise roadmap.	⚠ Higher resource usage. GraphQL adds learning curve.	HIGH	Qdrant, Pinecone
pgvector OSS Postgres extension for vector search	MVP	Extension	Keep everything in one DB (Postgres). No extra infra. Perfect when data volume is modest and you want SQL + vector in one query. Supabase includes it.	⚠ Slower ANN at large scale vs dedicated vector DBs. Not purpose-built.	LOW — <500k vecs	ChromaDB, Qdrant

🗄️

Database — Structured Storage

Relational, NoSQL, graph, and time-series storage for operational data

▼

Tool	Phase	Type	Why Choose It	Tradeoffs	Use Case Fit	Alternatives
SQLite OSS Embedded, serverless relational DB	MVP	Embedded	Zero setup. File-based. Perfect for agent memory, chat history, structured episodic memory stores in MVP. Pairs well with ChromaDB.	⚠ No concurrent writes. Not for multi-user prod.	Agent memory, local dev	PostgreSQL
PostgreSQL OSS Gold standard relational DB + pgvector	BOTH	OSS	Best all-round choice. ACID, JSON support, pgvector extension, mature ecosystem. If in doubt, choose Postgres. Scales to most production workloads.	⚠ Needs ops at scale. Not globally distributed natively.	Everything structured	MySQL, SQLite
Azure Cosmos DB AZURE Globally distributed NoSQL + vector preview	PROD	Managed	Best for globally distributed, multi-region Azure deployments. Multiple APIs (SQL, MongoDB, Cassandra). NoSQL for flexible schemas + now has vector search.	⚠ Expensive. RU model confusing. Azure lock-in.	Global chatbots, IoT, sessions	MongoDB Atlas, DynamoDB
MongoDB Atlas SaaS Managed document DB with vector search	BOTH	Managed	Great for flexible schemas (chat history, agent state, unstructured docs). Atlas Vector Search = no separate vector DB needed for moderate scale.	⚠ Not relational — joins are painful. Cost at scale.	Chat history, flexible records	Firestore, Cosmos DB
Neo4j OSS Graph DB — knowledge graphs, entity relations	PROD	OSS	Best for relationship-heavy data: knowledge graphs, ontologies, recommendation engines. GraphRAG uses Neo4j as graph memory store. Cypher query language.	⚠ Niche use case. Steep Cypher learning curve. Higher ops overhead.	GraphRAG, knowledge base	Amazon Neptune, TigerGraph
Redis / Upstash OSS In-memory key-value + vector store	PROD	Cache	Semantic cache for LLM responses (major cost saver). Session storage, rate limiting, real-time pub/sub. Redis Stack adds vector search.	⚠ Not a primary DB. Memory-bound. Persistence requires config.	Caching, sessions, rate limits	Memcached, DynamoDB

🧩

Agent Memory Architecture

Short-term, long-term, episodic, semantic, procedural memory for AI agents

▼

💡

Decision principle: Memory evolves: V1 = static lookup → V2 = agentic retrieval → V3 = multi-source integration → V4 = background self-updating memory. Match complexity to actual need. Most MVPs need only V1-V2.

Tool / Pattern	Phase	Memory Type	Why Choose It	Tradeoffs	Complexity	Alternatives
In-context Window OSS Pass full history in system prompt	MVP	Working Memory	Zero implementation. Works for short sessions. Sufficient for most chatbot MVPs. Claude/GPT-4o 128k+ context makes this viable longer.	⚠ Context saturation. Token cost scales linearly. No persistence.	MINIMAL	Summarization buffer
LangChain Memory Buffers OSS ConversationBufferMemory, SummaryMemory	MVP	Short-term + Summary	Easy plug-in memory for LangChain chains. SummaryMemory compresses old turns, solving token budget problem. Backed by any DB.	⚠ LangChain v1 memory deprecated in v0.3+. Moving to LangGraph preferred.	LOW	LangGraph checkpointer
LangGraph Checkpointer OSS Built-in state persistence for LangGraph agents	BOTH	Working + Episodic	Native state snapshot per turn. Supports resume, rollback, human-in-the-loop. Backends: SQLite (dev), PostgreSQL/Redis (prod). Most production-ready pattern.	⚠ LangGraph specific. Adds graph definition overhead.	MED	AutoGen state, custom DB
LangMem / LangGraph Store OSS Long-term semantic memory SDK for LangGraph	BOTH	Semantic + Episodic	Cognitive memory model (semantic, episodic, procedural) baked into LangGraph. Cross-thread memory persistence. Best structured approach to agent long-term memory.	⚠ Relatively new. Docs still maturing. LangGraph dependency.	MED	Letta/MemGPT, mem0
Letta / MemGPT OSS Paged memory OS for LLM agents	PROD	Full cognitive model	Most advanced open-source agent memory system. Paged memory (core/archival/recall), self-editing memory, multi-agent support. Ideal for long-running personal AI agents.	⚠ Complex setup. Opinionated architecture. Smaller community.	HIGH	LangMem, mem0
mem0 SaaS Managed memory layer for AI apps	PROD	Semantic LTM	Managed service — no infra. Automatically extracts, stores, retrieves memories across conversations. Good for SaaS AI products needing user-level personalization.	⚠ SaaS cost. Data leaves infra. Early stage product.	LOW	Letta, LangMem

📦

Data Strategy — Ingestion & ETL

Document parsing, chunking, embedding, pipeline, and data connectors for RAG

▼

Tool	Phase	Role	Why Choose It	Tradeoffs	Complexity	Alternatives
LlamaIndex OSS Data framework for LLM — parsing, indexing, querying	BOTH	RAG Framework	Best dedicated RAG toolkit. 160+ data connectors, advanced chunking strategies, query engines, reranking. Complements LangGraph for data-heavy RAG apps.	⚠ Can be complex to configure. Some overlap with LangChain.	MED	LangChain loaders
Unstructured.io OSS Document parsing: PDF, Word, HTML, images	BOTH	Doc Parser	Best-in-class for extracting clean text from messy documents. Handles tables, headers, images in PDFs. Open-source core + managed API for scale.	⚠ Managed API costs. Complex docs need tuning.	LOW	Azure Document Intelligence, Docling
Azure Document Intelligence AZURE OCR + layout analysis + form extraction	PROD	Doc Parser	Enterprise-grade for structured doc extraction (invoices, forms, contracts). Prebuilt models for common doc types. Tight Azure ecosystem integration.	⚠ Pay-per-page pricing. Azure lock-in.	MED	Unstructured, Textract
OpenAI / Azure Embeddings SaaS text-embedding-3-small/large	BOTH	Embeddings	State-of-the-art embedding quality. text-embedding-3-small = best cost/quality tradeoff. Critical: embed query and docs with SAME model.	⚠ Per-token cost. Data leaves infra. Changing models requires re-embedding.	LOW	Cohere, BGE, E5
Apache Airflow / Prefect OSS Workflow orchestration for data pipelines	PROD	Pipeline Orchestration	Schedule and monitor embedding refresh pipelines. Airflow for complex DAGs. Prefect for simpler Python-native flows. Essential for keeping vector store fresh.	⚠ Infra overhead. Overkill for simple scheduled jobs.	HIGH	Azure Data Factory, Dagster

🖥️

UI / Frontend

Chat interfaces, dashboards, admin panels, streaming UX

▼

Tool	Phase	Type	Why Choose It	Tradeoffs	Complexity	Alternatives
Streamlit OSS Python-native rapid UI for data apps	MVP	Python UI	Fastest time-to-demo for Python AI apps. Built-in chat components, file upload, streaming support. Perfect for internal tools and client POCs.	⚠ Not production-grade. Limited customization. Not for customer-facing apps.	LOW	Gradio, Chainlit
Chainlit OSS Chat UI framework built for LLM apps	MVP	Chat UI	Purpose-built for chatbot UIs. Step-by-step agent reasoning display, file attachments, streaming, auth. Best for quickly deploying a polished chat interface.	⚠ Less flexible than full React app. Python-only backend.	LOW	Streamlit, Open WebUI
Next.js + Vercel AI SDK OSS React framework with streaming AI hooks	BOTH	Full-Stack	Best for customer-facing AI products. Vercel AI SDK handles streaming SSE, useChat/useCompletion hooks, model switching. Production-ready, beautiful UX possible.	⚠ Requires frontend dev skills. More setup than Streamlit/Chainlit.	MED	Remix, SvelteKit
React + FastAPI OSS Decoupled frontend + AI backend	PROD	Custom Stack	Most flexible production architecture. FastAPI serves WebSocket / SSE streaming from Python agent. React consumes it. Full control, full customization.	⚠ Most engineering effort. Need both Python and JS/TS skills.	HIGH	Next.js + FastAPI, Django
Open WebUI OSS Self-hosted ChatGPT-like interface	MVP	Pre-built	Instant ChatGPT-like UI for any OpenAI-compatible API. Docker deploy. Supports multiple models, RAG, web browsing. Zero UI development needed.	⚠ Hard to customize deeply. More suited for internal tools.	MINIMAL	Chatbot UI, LibreChat

⚡

Backend / API Layer

REST/WebSocket API servers, streaming, auth, rate limiting

▼

Tool	Phase	Type	Why Choose It	Tradeoffs	Complexity	Alternatives
FastAPI OSS Modern async Python REST + WebSocket server	BOTH	Framework	Default choice for Python AI backends. Async-native for streaming LLM responses. Auto OpenAPI docs. SSE support built-in. Uvicorn/Gunicorn for prod.	⚠ Python GIL limits true parallelism. Need separate worker scaling strategy.	LOW	Django REST, Flask
Azure Functions AZURE Serverless compute for event-driven AI logic	PROD	Serverless	Best for event-driven tasks: webhook handlers, real-time voice pipeline steps, scheduled embedding refresh. Pay-per-execution. Tight Azure integration.	⚠ Cold start latency. Execution time limits. Azure vendor lock-in.	MED	AWS Lambda, Google Cloud Run
Azure API Management AZURE API gateway with rate limiting, auth, throttling	PROD	Gateway	Enterprise API gateway: rate limiting per user/key, token quota management, load balancing across Azure OpenAI PTU pools, auth, analytics. Essential for multi-tenant AI APIs.	⚠ Complex setup. Azure-only. Licensing cost.	HIGH	Kong, AWS API Gateway

🚀

Containerization & Deployment

Docker, orchestration, CI/CD, cloud deployment targets

▼

Tool	Phase	Layer	Why Choose It	Tradeoffs	Complexity	Alternatives
Docker + Compose OSS Container runtime + local multi-service orchestration	BOTH	Runtime	Universal standard. Compose for local dev with multiple services (API + vector DB + Redis). Same image promotes from dev to prod. No team should ship without this.	⚠ Not for production orchestration at scale — use K8s or managed containers.	LOW	Podman
Azure Container Apps AZURE Serverless K8s-based container hosting	PROD	Hosting	Best managed container platform on Azure. Auto-scaling to zero, KEDA-based event scaling, Dapr integration. Much simpler than AKS for most AI app deployments.	⚠ Less control than AKS. Not for stateful workloads or GPU inference.	MED	Azure App Service, AKS
AKS (Azure Kubernetes) AZURE Managed Kubernetes — full control	PROD	Orchestration	Required for GPU workloads (self-hosted LLMs), complex microservice AI architectures, custom networking/security requirements, or high-throughput production AI APIs.	⚠ High ops complexity. Requires K8s expertise. Cost.	HIGH	GKE, EKS
Railway / Render SaaS Simple PaaS for containerized apps	MVP	PaaS Hosting	Deploy FastAPI + Postgres + Redis in minutes. No Kubernetes. Git push deploys. Best for rapid MVP delivery when infra is not the focus.	⚠ Not enterprise-grade. Limited compliance controls. Vendor dependency.	LOW	Fly.io, Heroku
GitHub Actions SaaS CI/CD pipeline automation	BOTH	CI/CD	Standard CI/CD for most projects. Build → test → push Docker image → deploy to container platform. Free for public repos, generous free tier for private.	⚠ Complex pipelines become unwieldy. Azure DevOps better for deep Azure integration.	LOW	Azure DevOps, GitLab CI

📊

Logging, Monitoring & Observability

LLM tracing, cost tracking, latency monitoring, error alerting

▼

💡

Decision principle: AI observability ≠ traditional APM. You need LLM-specific tracing (prompt/response capture, token cost per run, chain step visibility). Add LangSmith or LangFuse early — you'll regret not having it in production debugging sessions.

Tool	Phase	Focus	Why Choose It	Tradeoffs	Complexity	Alternatives
LangSmith SaaS LangChain's LLM observability platform	BOTH	LLM Tracing	Best-in-class for LangChain/LangGraph apps. Auto-traces every chain/agent step, shows prompt/response, token cost, latency per node. Essential for debugging agent loops.	⚠ SaaS cost at scale. LangChain ecosystem only (though SDK is broader).	LOW	LangFuse, Arize
LangFuse OSS Open-source LLM observability — self-hostable	BOTH	LLM Tracing	Framework-agnostic LLM observability. Self-hostable (data sovereignty). Covers traces, evals, datasets, prompt management. Best OSS alternative to LangSmith.	⚠ Self-hosting adds ops. Smaller community than LangSmith.	MED	LangSmith, Helicone
Azure Monitor + App Insights AZURE Full-stack Azure observability platform	PROD	Platform Monitoring	Unified logs, metrics, traces for Azure-hosted apps. KQL queries for log analysis. Custom dashboards, alerting, distributed tracing. Required for enterprise Azure deployments.	⚠ Not LLM-specific. KQL learning curve. Cost by data volume.	MED	Datadog, Grafana Stack
Prometheus + Grafana OSS Metrics collection + visualization stack	PROD	Infra Metrics	Standard OSS metrics stack. Instrument FastAPI/vLLM with Prometheus exporters. Grafana dashboards for throughput, latency, token rates, queue depths, GPU utilization.	⚠ Infra overhead. Not LLM-aware out of the box — requires custom metrics.	HIGH	Datadog, Azure Monitor
Helicone SaaS LLM cost + usage analytics proxy	PROD	Cost Tracking	Drop-in proxy for OpenAI/Anthropic APIs. Tracks cost per user/session, caches responses, rate limits. Excellent for multi-tenant SaaS AI products where per-user cost matters.	⚠ Sits in request path — adds latency. SaaS dependency.	LOW	LangFuse, OpenLLMetry

🎙️

Voice AI Stack

STT, TTS, real-time voice pipeline, telephony integration

▼

💡

Decision principle: Voice latency perception <500ms end-to-end is the target. Real-time voice = WebSocket/WebRTC throughout (no HTTP polling). STT → LLM → TTS pipeline must all support streaming. For telephony: ACS Call Automation on Azure, Twilio elsewhere.

Tool	Phase	Layer	Why Choose It	Tradeoffs	Complexity	Alternatives
Whisper / Azure Speech STT OSS Speech-to-text transcription	BOTH	STT	Whisper (OSS): best accuracy, 100+ languages, self-hostable. Azure Speech: managed, streaming, real-time, enterprise SLA. Use Azure for production voice pipelines on Azure stack.	⚠ Whisper: not real-time by default. Azure: cost per audio hour.	MED	Deepgram, AssemblyAI
Deepgram SaaS Real-time STT with ultra-low latency	BOTH	STT (Real-time)	Best-in-class for real-time streaming STT. ~300ms latency. WebSocket API. Significantly faster than Azure Speech for live voice agent use cases.	⚠ SaaS cost. Data leaves infra. Per-minute pricing.	LOW	Azure Speech, AssemblyAI
ElevenLabs / Azure TTS SaaS Text-to-speech synthesis	BOTH	TTS	ElevenLabs: most natural voice quality, streaming TTS, voice cloning. Azure TTS: enterprise-grade, 400+ voices, Azure integration, Neural TTS. Choose based on naturalness vs compliance needs.	⚠ Per-character cost. Voice cloning raises ethical/legal issues.	LOW	OpenAI TTS, PlayHT
Azure ACS + Call Automation AZURE Telephony + real-time voice pipeline on Azure	PROD	Telephony	Enterprise telephony integration (PSTN, SIP). Call Automation API for programmatic call control, real-time transcription, media streaming to Azure Functions. ART Accelerator pattern.	⚠ Complex setup. ACS + Functions + Event Grid architecture. Azure-only.	HIGH	Twilio, Vonage
OpenAI Realtime API SaaS End-to-end real-time voice (GPT-4o)	BOTH	Full Voice Pipeline	Single WebSocket API for STT + LLM + TTS in one round-trip. Dramatically simplifies voice architecture. Voice activity detection included. Best for MVP voice agents.	⚠ Expensive. Less control over individual pipeline stages. OpenAI lock-in.	LOW	LiveKit, Daily.co + custom

🔐

Security, Auth & Guardrails

Identity, access control, prompt injection protection, output guardrails

▼

Tool	Phase	Layer	Why Choose It	Tradeoffs	Complexity	Alternatives
Auth0 / Azure AD B2C SaaS Identity and access management	BOTH	Auth/Identity	Auth0: fastest MVP auth, any stack, social logins, MFA. Azure AD B2C: enterprise identity for Azure-native apps, SAML/OIDC, conditional access. Don't build auth from scratch.	⚠ Auth0 cost at scale. B2C complex config. Vendor dependency.	LOW	Clerk, Supabase Auth
Azure Key Vault AZURE Secrets, keys, certificates management	BOTH	Secrets Mgmt	Never store API keys in code or env files for client deployments. Key Vault + Managed Identity = zero-credential access pattern. Required for enterprise Azure deployments.	⚠ Azure-specific. Adds latency if not cached.	LOW	HashiCorp Vault, AWS Secrets Manager
Guardrails AI / NeMo Guardrails OSS Output validation and prompt safety rails	PROD	LLM Safety	Validate LLM outputs against schemas, PII detection, topic restrictions, hallucination checks. NeMo: NVIDIA's rail framework with Colang language for policy definition.	⚠ Adds latency per call. Config overhead. False positives on edge cases.	MED	Azure Content Safety, Rebuff
Azure Content Safety AZURE Harmful content detection API	PROD	Content Moderation	Managed API for hate speech, violence, sexual content detection in both inputs and outputs. Promptshield for jailbreak/prompt injection detection. Required for public-facing Azure AI apps.	⚠ Per-call cost. Azure lock-in. Latency addition.	LOW	Guardrails AI, OpenAI moderation